-- Anton Chekhov
I have read many articles on machine learning, heard everyone else talking about it, but still don't know where to start. Until one day, when I was tunning the accuracy of my Hacker News crawler, it suddenly occurred to me - what I'm doing is to figure out the best parameters of a formula that can tell the possibility of a DOM being the main content, isn't this process called training in the terminology of machine learning? isn't it a good idea to feed all those features into a classifier model, and let it do the tunning to find out the best parameters (instead of me doing it manually)?
This is the “aha!” moment for me on machine learning, also the first time to solve a real-world problem using machine learning. So here comes the note I take when falling down the rabbit hole, I'll try to explain things from a newbie's view (actually my view), hoping others may get inspired.
It's super easy to get the main content of a HTML page, if you have done some crawler related works before.
Let's take a page of The Times as an example: XPATH or CSS selector, then apply the rule to Beautiful Soup or whatever your library of choices, and you are done - using this rule you can scrap the entire The Times and get their main contents, pretty straightforward right?
However, what if you have hundreds of pages to extract content from? writing down their XPATHes and using hundreds of if...else to apply those rules? Nah, you will become crazy maintaining all those rules. So extract the main content of "any" page is hard, how does Hacker News Digest solve this?
This method is based on an assumption that the DOM element with the most text is more likely to be the main content. Using this rule only, however, will lead to the html DOM been recognized as the main content, because it contains everything, so we also need to take the DOM depth into consideration. Besides, those who have more intensive <a> links, look like ADs, comments, sidebars, etc. should be reduced of weight. Combining all those rules, we can score every DOM in a page, and the one with the highest score is the winner to be the main content. Following are the factors currently used to score a DOM:
h element that has a high similarity to the title of this page, most blogs will put the title of its article into the <title> element for SEO purposeclass or id contains string like article, content, or ad-, comment, etc.<img>, and the area ratio of that image<a> links, if all children of an element are the <a> links, it's more likely to be a menu or ad, and less likely to be the main content.After quantifying those factors, I use a linear formula to calculate the score, so the problem becomes - how can I pick the right weight for those factors. The most naive approach is to set them from experience, then test against some pages, and adjust them accordingly. After many trials and errors, I finally found a set of weights that works for most of the pages posted on Hacker News, including news websites, blogs, Github readmes, etc.
These weights work so well that my crawler keeps sending me digest of Hacker News posts with high accuracy, but if I want to improve it, or add more rules, I need be careful and add more test cases to stabilize my changes, as it's becoming harder to maintain the balance between the old working weights and a new one. As you may have guessed, this is where Machine Learning kicks in!
Now that I have known the factors contributing to the possibility of a DOM being the main content, why not let the machine learning find the best weights to those factors? as it has a wide variety of models, optimized algorithms to figure out the best weights.
Time to get our hands dirty with some code
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import SVC
In [2]:
dataframe = pd.read_csv('/tmp/features.csv')
dataframe.head()
Out[2]:
target column means whether this DOM is the main content or not, tagged by me manually.depth, text_ratio and alink_text_ratio columns are three key features I mentioned early, note that they have been normalized so they are comparable across all kinds of web pages. Let me take the text_ratio as an example, if DOM A in a page has 500 words, and DOM B in another page has only 100 words, you can not get the conclusion that DOM A has a higher possibility than DOM B to be the main content, because they are from different pages!attr column is a set of strings, namely tag name, class and id, as some of them like article, content are more likely to have a positive contributions, and others like comment, ad may have negative contributions, which I'll leave to the machine learning algorithm figure out.In order to convert those attr strings into numbers, I'm using the CountVectorizer method, which tokenizes and counts the word occurrences a text document, so most frequent attributes will have a higher value(count).
In [3]:
y = dataframe.target
X = dataframe.drop(['target'], axis=1)
corpus = X['attr']
vc = CountVectorizer()
vc.fit(corpus)
numeric_features = pd.concat([X.drop(['attr'], axis=1), pd.DataFrame(vc.transform(corpus).toarray(), columns=vc.vocabulary_)], axis=1)
numeric_features.head()
Out[3]:
In [5]:
plt.scatter(dataframe.index, dataframe.target, color='red', label='target')
plt.scatter(numeric_features.index, numeric_features.depth, color='green', label='depth')
plt.scatter(numeric_features.index, numeric_features.text_ratio, color='blue', label='text_ratio')
plt.scatter(numeric_features.index, numeric_features.alink_text_ratio, color='skyblue', label='alink_text_ratio')
plt.legend(loc=(1, 0))
plt.show()
train and predict routineHere for simplicity reason, I'll just verify against the training dataset itself. But in the real world, you should split your data into training and testing dataset, and use cross validation to evaluate the model's performance.
As shown in the code, I also tried several classifiers, the Naive Bayes, Random Forest, and finally chose SVM with a polynomial kernel. I cannot explain why it works better then others, maybe a polynomial model is more suitable in my case.
In [4]:
scaler = preprocessing.StandardScaler()
scaler.fit(numeric_features)
scaled_X = scaler.transform(numeric_features)
# clf = MultinomialNB()
# clf = RandomForestClassifier()
clf = SVC(C=1, kernel='poly', probability=True)
clf.fit(scaled_X, y)
predicted_index = clf.predict(scaled_X).tolist().index(True)
scaled_X = scaler.transform(numeric_features)
pred_y = clf.predict(scaled_X)
print pd.DataFrame(clf.predict_log_proba(scaled_X),
columns=clf.classes_)
print 'Number of mispredicted out of %d is %d (%.2f%%)' % (y.shape[0], (y!=pred_y).sum(), (y!=pred_y).sum()*100.0/y.shape[0])
print
print 'Predicted rows:'
print dataframe[pred_y].drop(['text_ratio', 'alink_text_ratio', 'contain_title'], axis=1).merge(pd.DataFrame(clf.predict_log_proba(scaled_X)[pred_y],
columns=clf.classes_, index=dataframe[pred_y].index), left_index=True, right_index=True)
print
# print 'Acutual rows:'
# print dataframe[dataframe.target]
So we now have a working classifier that can tell us what DOM in a page is most likely to be the main content, and it is just under 20 lines of code! Along the way, I also learnt machine learning is not that obvious, you need to convert your problem into a model-solving one, and that's where machine learning can do the best.
This probably shouldn’t be used in production, because it’s missing loads of features (like evaluating the performance of the model, error handling when the model thinks none of the DOM is the main content etc…) but has hopefully given you a idea that machine learning is not that intimidating, and how it can be used to solve a real-world problem.